This is the 4th project of Udacity’s Data Analyst NanoDegree Program. We were given several data sources as options to analyze from. I chose the Arizona’s 2016 Presidential Campaign Finance from the Federal Election Commission (FEC) website.
The format of this analysis will be as the following:
1- I will declare my intentions with a hypothesis (if applicable)
2- Insert a R snippet/code and run it
3- Declare my findings.
And so on…
First I will begin the analysis by exploring basic statistics about the data set. This will help me see the nature of the data, and whether the data needs cleaning or wrangling. Afterwards, I will explore variable and multivariate relationships, by using the methods I have learned in chapter 4, such as scatter, line, box plots and histograms. This is the basic outline of the analysis, but surely I will find interesting things to talk about along the way.
The structure of the data is as the following The file has 19 variables, and these are the most important ones to for the analysis:
I wish if there was a party and a gender column, I will try to it below.
# Add party col
# Note code template was taken from Udacity Forums
index <- c("Johnson, Gary", "Stein, Jill", "McMullin, Evan")
dindex <- c("Clinton, Hillary Rodham", "Sanders, Bernard", "Lessig, Lawrence",
"O'Malley, Martin Joseph", "Webb, James Henry Jr.")
rindex <- c('Bush, Jeb', "Carson, Benjamin S."
, "Christie, Christopher J", "Cruz, Rafael Edward 'Ted'",
"Fiorina, Carly", "Gilmore, James S III" ,
"Graham, Lindsey O.", "Huckabee, Mike",
"Jindal, Bobby", "Kasich, John R.",
"Paul, Rand", "Perry, James R. (Rick)",
"Rubio, Marco", "Trump, Donald J.",
"Walker, Scott" )
attach(az)
az$party[cand_nm %in% index] <- "independent"
az$party[cand_nm %in% dindex] <- "democrat"
az$party[cand_nm %in% rindex] <- 'republican'
detach(az)
# Convert party to factor
az$party <- factor(az$party)
I also would like to add other information such as latitudes and longitudes for map analysis
I would also like to integrate population data by zip-code from the 2010 ZCTA census.
Now that I added candidates’ genders, I’ll add the contributors’ genders, by using the gender package.
I would like to know how the data is distributed.
Date distribution
It appears that we have negative numbers, that goes all to -5400. I believe that it represents refunds, since the most receipt comment is receipt.
I want to find out the amount stats without the refunds.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04 15.00 27.00 80.21 61.86 10800.00
## 'table' int [1:2224(1d)] 1 1 2 2 1 25 1 861 1 1 ...
## - attr(*, "dimnames")=List of 1
## ..$ non_zero_rec: chr [1:2224] "0.04" "0.12" "0.24" "0.5" ...
## non_zero_rec
## 3 5 8 10 15 19 20 25 27 28 35 38
## 1126 7727 1869 11486 5699 2539 2638 17543 5432 2258 2322 1014
## 40 50 75 80 100 200 250 500
## 2340 13346 1325 2038 11547 1941 4095 1380
I wonder why the odd numbers such as 19, 27 or even 38.
Below we will see the number of contributions for each candidate and how they break out:
## Clinton, Hillary Rodham Sanders, Bernard
## 53861 35784
## Trump, Donald J. Cruz, Rafael Edward 'Ted'
## 16087 7129
## Carson, Benjamin S. Rubio, Marco
## 2954 1657
## Fiorina, Carly Paul, Rand
## 462 426
## Johnson, Gary Kasich, John R.
## 318 263
## Stein, Jill Bush, Jeb
## 199 122
## Huckabee, Mike McMullin, Evan
## 101 98
## Walker, Scott O'Malley, Martin Joseph
## 95 29
## Christie, Christopher J. Santorum, Richard J.
## 19 19
## Jindal, Bobby Graham, Lindsey O.
## 10 9
## Webb, James Henry Jr. Lessig, Lawrence
## 5 4
## Perry, James R. (Rick) Gilmore, James S III
## 1 0
I would like to see the box-plot of each gender/party contribution. Republicans contributed more on average, and they had a higher range of contribution amounts. Male republicans contributed slightly more on average than their female counterparts.
##
## female male
## 65569 54083
I did not expect to find more female contributors than males in this data-set.
Lets explore if females were more likely to vote for female candidates.
## [1] 0.546188
## [1] 0.653514
%54.5 females of this data-set contributed to females, while %65.4 of males contributed to males, which is a negligible preference.
## democrat independent republican N
## 80.341880 1.709402 17.948718 117.000000
## Clinton, Hillary Rodham Sanders, Bernard
## 59.54716981 36.50314465
## Cruz, Rafael Edward 'Ted' Trump, Donald J.
## 1.43396226 1.40880503
## Carson, Benjamin S. Rubio, Marco
## 0.42767296 0.20125786
## Stein, Jill Paul, Rand
## 0.20125786 0.15094340
## Graham, Lindsey O. Fiorina, Carly
## 0.05031447 0.02515723
## Johnson, Gary Kasich, John R.
## 0.02515723 0.02515723
## Bush, Jeb Christie, Christopher J.
## 0.00000000 0.00000000
## Gilmore, James S III Huckabee, Mike
## 0.00000000 0.00000000
## Jindal, Bobby Lessig, Lawrence
## 0.00000000 0.00000000
## McMullin, Evan O'Malley, Martin Joseph
## 0.00000000 0.00000000
## Perry, James R. (Rick) Santorum, Richard J.
## 0.00000000 0.00000000
## Walker, Scott Webb, James Henry Jr.
## 0.00000000 0.00000000
Around %80 of colleges had a democratic preference. The majority of %59 of contributions were for Clinton, Sanders cones in second of %37. Cruz came in third (%1.42) and Trump close fourth (%1.4).
Below I will find the stats of homemakers and retirees
## first_name clean_zip cmte_id cand_id
## Length:1076 Length:1076 Length:1076 P00003392:600
## Class :character Class :character Class :character P60006111:146
## Mode :character Mode :character Mode :character P60007168:105
## P60005915: 99
## P80001571: 84
## P60006723: 14
## (Other) : 28
## cand_nm contbr_nm
## Clinton, Hillary Rodham :600 BORCH, INGER : 39
## Cruz, Rafael Edward 'Ted':146 GUIDARELLI-AMBRAD, DEBORAH: 35
## Sanders, Bernard :105 FRANK, GLORIA : 29
## Carson, Benjamin S. : 99 FRANZ, ROBIN : 29
## Trump, Donald J. : 84 DOVER, RITA : 28
## Rubio, Marco : 14 BADE, KRISTI : 27
## (Other) : 28 (Other) :889
## contbr_city contbr_st contbr_zip contbr_employer
## SCOTTSDALE :212 AZ:1076 857507118: 39 N/A :531
## TUCSON :171 852533610: 35 HOMEMAKER :306
## PHOENIX :143 852043820: 29 RETIRED : 59
## GILBERT : 85 852951792: 29 NONE : 41
## MESA : 83 853021415: 28 NOT EMPLOYED: 39
## PARADISE VALLEY: 57 852543072: 27 MY CHILDREN : 25
## (Other) :325 (Other) :889 (Other) : 75
## contbr_occupation contb_receipt_amt
## HOMEMAKER :1028 Min. : -40.0
## UNEMPLOYED - HOMEMAKER : 25 1st Qu.: 25.0
## HOMEMAKER / PHOTOGRAPHER / MSW: 5 Median : 50.0
## HOMEMAKER/ACTIVIST/ARTIST : 5 Mean : 137.1
## HUSBAND/MECHANICWIFE/HOMEMAKER: 5 3rd Qu.: 100.0
## HOMEMAKER/PHYSICIAN : 3 Max. :2700.0
## (Other) : 5
## contb_receipt_dt
## 19-OCT-16: 14
## 03-NOV-16: 12
## 06-NOV-16: 12
## 09-OCT-16: 12
## 26-SEP-16: 12
## 04-NOV-16: 11
## (Other) :1003
## receipt_desc
## :1076
## * EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING: 0
## * REATTRIBUTED FROM EDWARD FARMILANT : 0
## * REATTRIBUTED TO BARBARA FAMILANT : 0
## * REATTRIBUTED TO VICTORIA STRONG : 0
## EVENT PLANNING REATTRIBUTION FROM SPOUSE : 0
## (Other) : 0
## memo_cd memo_text form_tp
## :906 :872 SA17A:912
## X:170 * EARMARKED CONTRIBUTION: SEE BELOW: 99 SA18 :164
## * HILLARY VICTORY FUND : 98 SB28A: 0
## *BEST EFFORTS UPDATE : 5
## * : 1
## EARMARKED FROM MAKE DC LISTEN : 1
## (Other) : 0
## file_num tran_id election_tp
## Min. :1014598 C5628470 : 2 : 4
## 1st Qu.:1077853 A105C04C73FFA4C859DB: 1 G2016:452
## Median :1109498 A6BF5A3EFECE4468B9E9: 1 O2016: 1
## Mean :1103419 A85C4E16099CC4E5F8A1: 1 P2016:619
## 3rd Qu.:1133930 AAA1CD0DBF8AB4B9281D: 1 P2020: 0
## Max. :1146165 AFCCA0974E8D949428D0: 1
## (Other) :1069
## proper_date party city
## Min. :2015-04-01 democrat :705 Length:1076
## 1st Qu.:2016-02-27 independent: 14 Class :character
## Median :2016-06-21 republican :357 Mode :character
## Mean :2016-05-23
## 3rd Qu.:2016-09-21
## Max. :2016-12-02
##
## state latitude longitude cand_gender
## Length:1076 Min. :31.49 Min. :-114.6 a : 0
## Class :character 1st Qu.:33.30 1st Qu.:-112.1 Female:602
## Mode :character Median :33.49 Median :-111.9 Male :474
## Mean :33.37 Mean :-111.8
## 3rd Qu.:33.62 3rd Qu.:-111.7
## Max. :36.62 Max. :-109.4
##
## contrib_gender
## female:1035
## male : 41
##
##
##
##
##
## female male N
## 96.189591 3.810409 1076.000000
## democrat independent republican N
## 65.520446 1.301115 33.178439 1076.000000
## [1] 35.57178
## democrat independent republican NA's
## 60.00657639 0.44540101 39.52410845 0.02391415
## [1] 81.05961
Homemakers are %96 females, and %65 of homemakers are democrats.
As we can see above, retirees make up about %35.6 of the data-set. Around %60 of retirees contributed to democrats and around %40 percent to republicans, contributions to independents are negligible. Retirees contributed $81 on average.
I want to know which occupations are most politically active, and how do they lean politically.
## [,1]
## RETIRED 32749
## NOT EMPLOYED 13737
## INFORMATION REQUESTED 3214
## ATTORNEY 2107
## PHYSICIAN 1897
## TEACHER 1821
## ENGINEER 1389
## CONSULTANT 1272
## PROFESSOR 1239
## SALES 1238
The most politically active occupations in the data set are attorneys, physicians then teachers.
Below I would like to know the proportions of party leaning for each job. For example, of all engineers how many percent of them lean republican (number of republican engineers/ total number of engineers).
I would like to see average spending along dates
It seems that the avg amount of contributions are huge at the beginning of 2015, but when I added a 4th variable (n = number of contributions) it shows that these were a few outliers, the mass of the contributions came in mid 2016 as it lowered the average but the size (n) was bigger substantially.
Note: there is discrepancy in the color scale:
Brain storming: What can I do to improve?
What kind of graphs could I add? -Bar chart of most contributing jobs to the Donald
I am wondering what kind jobs contributed to Donald trump, my intuition says it’s mostly blue collar jobs. Let’s find out!
## [1] Trump, Donald J. Sanders, Bernard
## [3] Cruz, Rafael Edward 'Ted' Clinton, Hillary Rodham
## [5] Stein, Jill Carson, Benjamin S.
## [7] Paul, Rand Fiorina, Carly
## [9] Rubio, Marco Johnson, Gary
## [11] Bush, Jeb Kasich, John R.
## [13] Santorum, Richard J. McMullin, Evan
## [15] Webb, James Henry Jr. Huckabee, Mike
## [17] Walker, Scott Christie, Christopher J.
## [19] Jindal, Bobby O'Malley, Martin Joseph
## [21] Lessig, Lawrence Graham, Lindsey O.
## [23] Perry, James R. (Rick)
## 24 Levels: Bush, Jeb Carson, Benjamin S. ... Webb, James Henry Jr.
My hypothesis is false, most of trumps contributors have white collar jobs, even the higher income types such as engineers, consultants, physicians and CEOs. One weakness of this plot, it does not represent low income contributors.
let me see by number of contributions only if it helps me find out more,
## [1] 0.1344482
The percentage of contributions for trump of the whole data-set is 18%.
By changing some of the subset filters, still the majority were high income occupations, even though we have some blue collar jobs such as truck driver and construction, but they were the minority. My hypothesis is blue-collar workers cannot afford to contribute therefore, they are underrepresented in this data-set.
I want to see if higher income zip-codes had more contributions and I will use a scatter-plot to demonstrate. There is only a strong relationship When I subsetted the data to 100 contributions at least per zip-code. Doing otherwise will skew the data and the relationship will not be apparent.
There is a strong correlation at first, but then as population increases the relationship weakens.
## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
overall this project was a good challenge and learning experience. At first it was easy and enjoyable exploring the data, as I went deeper into the analysis it became harder to come up with relationships and conclusions about the data. I wanted my analysis to have a central theme/thesis, the fact of not drawing a certain conclusion made me feel frustrated.
I was impressed with the versatility of R, and its packages, I felt like it was more intuitive than python, maybe because I have a background with Alteryx. Although, R felt like it had less support on stackoverflow than python, but there’s support nonetheless, which aided me significantly throughout the project. I also used Datacamp for filling in the knowledge gaps and reinforcing the concepts learned in the Udacity curriculum. I have not utilized Udacity’s live help as much as the other projects, because I did not face problems with programming itself, rather than loss of ideas and direction of my analysis.
In terms of visualizations, R is fantastic for data exploration, although it is lacking the ability to export high resolution plots. I feel that Tableau is more suitable for findings/conclusive plots.